Data Science

Review & Future Directions

2014-04-07
Instructor: Alessandro Gagliardi
TA: Kevin Perko

Agenda

  1. Review
    • What we covered
      • Data
      • Science
    • What we didn't
      • Time Series Analysis
      • Network Analysis
  2. Future Directions
    • Types of Data Science
      • 4, 5, 8 types of data scientists
    • From Hacker to Operator
      • Experience
      • Mentorship

Review

What we covered:

Data

  1. Big Data
    • Hadoop
    • IPython.parallel & StarCluster
  2. APIs
    • Twitter
    • JSON
  3. Relational Databases
    • SQL
    • 1st, 2nd, and 3rd Normal Form
  4. Feature Vectors
    • Data Frames
    • Term-Document Matrices
  5. Visualization
    • ggplot2

Science

  1. Regression
    • Linear Regression
  2. Classification
    • Logistic Regression
    • k-Nearest Neighbors
    • Decision Trees
    • Artificial Neural Networks
    • Support Vector Classifiers
  3. Dimensionality Reduction
    • Principal Component Analysis
  4. Clustering
    • k-Means Clustering

What we didn't:

  • Pig
  • Hive, Impala/Presto
  • Spark, Shark
  • Natural Language Processing
  • Time Series Analysis
  • Network (i.e. Graph) Analysis
  • Interactive Visualization (e.g. D3.js)
  • ...

Briefly:

Time Series Analysis

The main difference between time series analysis and other forms of analysis is that each instance is not independant of each other instance. (e.g. Sales on day $t$ may be related to sales on day $t-1$ while sales by clerk $x$ are (hopefully) independant of sales by clerk $x-1$)

Autoregression

The notation AR($p$) refers to the autoregressive model of order $p$. The AR($p$) model is written

$$ X_t = c + \sum_{i=1}^p \varphi_i X_{t-i}+ \varepsilon_t .\, $$

where $\varphi_1, \ldots, \varphi_p$ are parameters, $c$ is a constant, and the random variable $\varepsilon_t$ is white noise.

Moving Average

The notation MA($q$) refers to the moving average model of order $q$:

$$ X_t = \mu + \varepsilon_t + \sum_{i=1}^q \theta_i \varepsilon_{t-i}\, $$

where the $\theta_1, \ldots, \theta_q$ are the parameters of the model, $\mu$ is the expectation of $X_t$ (often assumed to equal 0), and the $\varepsilon_t$, $\varepsilon_{t-1}$,... are again, white noise error terms.

The moving-average model is essentially a finite impulse response filter with some additional interpretation placed on it.

ARMA model

The notation ARMA($p, q$) refers to the model with $p$ autoregressive terms and $q$ moving-average terms. This model contains the AR($p$) and MA($q$) models,

$$ X_t = c + \varepsilon_t + \sum_{i=1}^p \varphi_i X_{t-i} + \sum_{i=1}^q \theta_i \varepsilon_{t-i}.\,$$

*(from [Wikipedia](http://en.wikipedia.org/wiki/Autoregressive_moving_average))*

Extensions include:

  • ARIMA - Autoregressive integrated moving average
  • ARMAX - Autoregressive–moving-average with exogenous inputs

Other time series methods include:

  • Time Series Principal Component Analysis

Briefly:

Network Analysis

Like text mining, network--or, graph--analysis involves big, sparse matrices.

Graphs contain:

  • Nodes or Vertices
  • Links or Edges

Graphs can be:

  • Directed
  • Undirected
    • Undirected graphs are directed graphs were all vertices are reciprocal

Directed graphs can be:

  • Cyclic
  • Acyclic

(A tree is an example of an directed acyclic graph or DAG)

...becomes...

1 2 3 4 5 6
1 1 0 0 1 0
2 1 0 1 0
3 1 0 0
4 1 1
5 0

Small-World Networks

A network where the typical distance $L$ between two randomly chosen nodes (the number of steps required) grows proportionally to the logarithm of the number of nodes $N$ in the network, that is:

$$ L \propto \log N$$

Future Directions

Categories of data scientists

*(according to [Vincent Granville](http://www.datasciencecentral.com/profiles/blogs/six-categories-of-data-scientists))*

  • Those strong in statistics: they are expert in statistical modeling, experimental design, sampling, clustering, data reduction, confidence intervals, testing, modeling, predictive modeling and other related techniques.
  • Those strong in mathematics: NSA or defense/military, astronomers, and operations research people doing analytic business optimization (inventory management and forecasting, pricing optimization, supply chain, quality control, yield optimization).
  • Those strong in data engineering, Hadoop, database/memory/file systems optimization and architecture, API's, Analytics as a Service, optimization of data flows, data plumbing.
  • Those strong in machine learning / computer science (algorithms, computational complexity)
  • Those strong in business, ROI optimization, decision sciences, involved in some of the tasks traditionally performed by business analysts in bigger companies (dashboards design, metric mix selection and metric definitions, ROI optimization, high-level database design)
  • Those strong in production code development, software engineering (they know a few programming languages)
  • Those strong in visualization
  • Those strong in GIS, spatial data, data modeled by graphs, graph databases
  • Those strong in a few of the above.

*(according to [Tomasz Tunguz](https://www.linkedin.com/today/post/article/20131002174328-4444200-which-of-the-five-types-of-data-science-does-your-startup-need))*

  1. Quantitative, exploratory data scientists tend to have PhDs and use theory to understand behavior. Varian’s team researches the advertiser dynamics within the ads auction and compares those dynamics to theoretical auction models like theVickery auction. By combining theory and exploratory research, these data scientists improve products.
  2. Operational data scientists often work in the finance, sales or operations teams at Google. In the AdSense ops . . . a star data analyst who each week would discuss our team’s performance: our email response times, the satisfaction scores of our publishers, and changes in publisher behavior by segment. His work provided a feedback loop to improve the team’s tactics and efficiency.
  3. Product data scientists tend to belong to product management or engineering. PMs and engineers sift through logs and analysis tools to understand the way users interact a product and leverage that knowledge to refine the product. At Google, the ads quality team analyzed user clicks data to improve ad targeting.
  4. Marketing data scientists segment the user base, evaluate the performance of advertising campaigns, match product features to customer segments, and design content marketing campaigns. The marketing data scientist creates awareness and leads for the sales team, helping generate revenue.
  5. Research data scientists create insights as a product. Nate Silver is arguably the most famous of them. Silver’s work doesn’t influence a product; the analysis is the product itself. Sometimes the data science leads to a thought leadership whitepaper, or a blog post, or a financial report.

*(emphasis mine)*

*(according to [Harlan D. Harris](http://strata.oreilly.com/2013/06/theres-more-than-one-kind-of-data-scientist.html))*

  • Data Businesspeople are the product and profit-focused data scientists. They’re leaders, managers, and entrepreneurs, but with a technical bent. A common educational path is an engineering degree paired with an MBA.
  • Data Creatives are eclectic jacks-of-all-trades, able to work with a broad range of data and tools. They may think of themselves as artists or hackers, and excel at visualization and open source technologies.
  • Data Developers are focused on writing software to do analytic, statistical, and machine learning tasks, often in production environments. They often have computer science degrees, and often work with so-called “big data”.
  • Data Researchers apply their scientific training, and the tools and techniques they learned in academia, to organizational data. They may have PhDs, and their creative applications of mathematical tools yields valuable insights and products.

*(according to [Brendan Tierney](http://www.oralytics.com/2013/03/type-i-and-type-ii-data-scientists.html))*

  1. The Type I Data Scientist specializes in...

    • Statisticians
    • Data Miners
    • Predictive Modellers
    • Machine Learning
    • Data Warehousing
    • Business Intelligence & Visualization
    • Big Data
    • R / Oracle / SAS / SPSS / etc.
  2. The Type II Data Scientist approaches the types of problems that organisations are facing in a different way. They will concentrate on the business goals and business problems that the organisation are facing. Based on these they will identify what the data scientist project will focus on, ensuring that there is a measurable outcome and business goal. The Type II Data Scientist will be a good communicator, being able to translate between the business problem and the technical environment necessary to deliver what is needed. During the project the data science team will discovery various insight about the data. The Type II Data Scientist will prioritise these and feed them back to the various business units. Some of these insights can range from something new, verifying business knowledge beliefs, areas where better data capture is needed, improvements in applications, etc.

Types Levels of Data Science

*(according to [Steve Jones](http://service-architecture.blogspot.com/2014/03/what-are-types-of-data-scientist.html))*

  1. Data Science Bluffers They are the people who get a spreadsheet with a bunch of data, apply a very basic statistical function and claim 'Hey its Data Science'.
  2. Data Hackers 'the one eyed man in the kingdom of the blind'...people with a bit of skill, maybe a bit of training, but they aren't at the level of sophistication of an operator...don't mistake knowing how to apply one Machine Learning technique for actual knowledge.
  3. Data Operators or Resident Data Scientists take predefined algorithms, statistical or machine learning, and then apply them to a specific company scenario and most crucially keep the parameters up to date so the algorithm continues to perform.
  4. Data Magicians normally have mathematical or physics centric PhDs (often several), often focused in specific areas such as fluid dynamics, economics or super specific such as wind-turbines... The reason its science is because its testable and provable. They can show that their algorithm would have produced 5% improvement in performance over the past 5 years, and as it moves forward show how their approach has made a difference to the performance of a business.
You are probably at the "Data Hacker" level right now and should strive to become an Operator or Resident Data Scientist. (I'm at the Operator level and striving to become a Magician.)

Q. How do I go from Data Hacker to a Data Operator or Resident Data Scientist?

  1. Experience - Challenge yourself with projects just outside of your ability and learn the techniques you need to achieve it.
  2. Mentorship - Ideally someone in your organization invested in your growth.
    • Also: the greater Data Science community

Tools to keep an eye on:

  • D3.js
  • Spark
  • Julia

Useful Languages (besides Python, R, and SQL)

  • Julia
  • Scala
  • Pig
  • JavaScript
  • Java (sorry)

Questions?